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Abstract 

We describe a project to capitalize on newly available levels of computational resources in order to un- 
derstand human cognition. We will build an integrated physical system including vision, sound input and 
output, and dextrous manipulation, all controlled by a continuously operating large scale parallel MIMD 
computer. The resulting system will learn to "think" by building on its bodily experiences to accomplish 
progressively more abstract tasks. Past experience suggests that in attempting to build such an integrated 
system we will have to fundamentally change the way artificial intelligence, cognitive science, linguistics, 
and philosophy think about the organization of intelligence. We expect to be able to better reconcile the 
theories that will be developed with current work in neuroscience. 
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1 Project Overview 

We propose to build an integrated physical humanoid 
robot including active vision, sound input and output, 
dextrous manipulation, and the beginnings of language, 
all controlled by a continuously operating large scale par- 
allel MIMD computer. This project will capitalize on 
newly available levels of computational resources in or- 
der to meet two goals: an engineering goal of building 
a prototype general purpose flexible and dextrous au- 
tonomous robot and a scientific goal of understanding 
human cognition. While there have been previous at- 
tempts at building kinematically humanoid robots, none 
have attempted the embodied construction of an au- 
tonomous intelligent robot; the requisite computational 
power simply has not previously been available. 

The robot will be coupled into the physical world with 
high bandwidth sensing and fast servo-controlled actua- 
tors, allowing it to interact with the world on a human 
time scale. A shared time scale will open up new possi- 
bilities for how humans use robots as assistants, as well 
as allowing us to design the robot to learn new behaviors 
under human feedback such as human manual guidance 
and vocal approval. One of our engineering goals is to 
determine the architectural requirements sufficient for an 
enterprise of this type. Based on our earlier work on mo- 
bile robots, our expectation is that the constraints may 
be different than those that are often assumed for large 
scale parallel computers. If ratified, such a conclusion 
could have important impacts on the design of future 
sub-families of large machines. 

Recent trends in artificial intelligence, cognitive sci- 
ence, neuroscience, psychology, linguistics, and sociol- 
ogy are converging on an anti-objectivist, body-based 
approach to abstract cognition. Where traditional ap- 
proaches in these fields advocate an objectively speci- 
fiable reality — brain-in-a-box, independent of bodily 
constraints — these newer approaches insist that intelli- 
gence cannot be separated from the subjective expe- 
rience of a body. The humanoid robot provides the 
necessary substrate for a serious exploration of the 
subjectivist — body-based — hypotheses. 

There are numerous specific cognitive hypotheses that 
could be implemented in one or more of the humanoids 
that will be built during the five-year project. For ex- 
ample, we can vary the extent to which the robot is pro- 
grammed with an attentional preference for some images 
or sounds, and the extent to which the robot is pro- 
grammed to learn to selectively attend to environmental 
input as a by-product of goal attainment (e.g., success- 
ful manipulation of objects) or reward by humans. We 
can compare the behavioral result of constructing a hu- 
manoid around different hypotheses of cortical represen- 
tation, such as coincidence detection versus interpolat- 
ing memory versus sequence seeking in counter streams 
versus time-locked multi-regional retroactivation. In the 
later years of the project we can connect with theories 
of consciousness by demonstrating that humanoids de- 
signed to continuously act on immediate sensory data 
(as suggested by Dennett's multiple drafts model) show 
more human-like behavior than robots designed to con- 
struct an elaborate world model. 



The act of building and programming behavior-based 
robots will force us to face not only issues of interfaces 
between traditionally assumed modularities, but even 
the idea of modularity itself. By reaching across tradi- 
tional boundaries and tying together many sensing and 
acting modalities, we will quickly illuminate shortcom- 
ings in the standard models, shedding light on formerly 
unrealized sociologically shared, but incorrect, assump- 
tions. 

2 Background: the power of enabling 
technology 

An enabling technology — such as the brain that we will 
build — has the ability to revolutionize science. A recent 
example of the far-reaching effects of such technologi- 
cal advances is the field of mobile robotics. Just as the 
advent of cheap and accessible mobile robotics dramat- 
ically altered our conceptions of intelligence in the last 
decade, we believe that current high-performance com- 
puting technology makes the present an opportune time 
for the construction of a similarly significant integrated 
intelligent system. 

Over the last eight years there has been a renewed 
interest in building experimental mobile robot systems 
that operate in unadorned and unmodified natural and 
unstructured environments. The enabling technology for 
this was the single chip micro-computer. This made it 
possible for relatively small groups to build serviceable 
robots largely with graduate student power, rather than 
the legion of engineers that had characterized earlier ef- 
forts along these lines in the late sixties. The accessibil- 
ity of this technology inspired academic researchers to 
take seriously the idea of building systems that would 
work in the real world. 

The act of building and programming behavior-based 
robots fundamentally changed our understanding of 
what is difficult and what is easy. The effects of this 
work on traditional artificial intelligence can be seen in 
innumerable areas. Planning research has undergone 
a major shift from static planning to deal with "reac- 
tive planning." The emphasis in computer vision has 
moved from recovery from single images or canned se- 
quences of images to active — or animate — vision, where 
the observer is a participant in the world controlling the 
imaging process in order to simplify the processing re- 
quirements. Generally, the focus within AI has shifted 
from centralized systems to distributed systems. Fur- 
ther, the work on behavior-based mobile robots has also 
had a substantial effect on many other fields (e.g., on the 
design of planetary science missions, on silicon micro- 
machining, on artificial life, and on cognitive science). 
There has also been considerable interest from neuro- 
science circles, and we are just now starting to see some 
bi-directional feedback there. 

The grand challenge that we wish to take up is to 
make the quantum leap from experimenting with mo- 
bile robot systems to an almost humanoid integrated 
head system with saccading foveated vision, facilities 
for sound processing and sound production, and a com- 
pliant, dextrous manipulator. The enabling technol- 



ogy is massively parallel computing; our brain will have 
large numbers of processors dedicated to particular sub- 
functions, and interconnected by a fixed topology net- 
work. 

3 Scientific Questions 

Building an android, an autonomous robot with hu- 
manoid form, has been a recurring theme in science fic- 
tion from the inception of the genre with Frankenstein, 
through the moral dilemmas infesting positronic brains, 
the human but not really human C3P0 and the ever 
present desire for real humanness as exemplified by Com- 
mander Data. Their bodies have ranged from that of a 
recycled actual human body through various degrees of 
mechanical sophistication to ones that are indistinguish- 
able (in the stories) from real ones. And perhaps the 
most human of all the imagined robots, HAL-9000, did 
not even have a body. 

While various engineering enterprises have modeled 
their artifacts after humans to one degree or another 
(e.g., WABOT-II at Waseda University and the space 
station tele-robotic servicer of Martin-Marietta) no one 
has seriously tried to couple human like cognitive pro- 
cesses to these systems. There has been an implicit, and 
sometimes explicit, assumption, even from the days of 
Turing (see Turing (1970) 1 ) that the ultimate goal of 
artificial intelligence research was to build an android. 
There have been many studies relating brain models to 
computers (Berkeley 1949), cybernetics (Ashby 1956), 
and artificial intelligence (Arbib 1964), and along the 
way there have always been semi-popular scientific books 
discussing the possibilities of actually building real 'live' 
androids (Caudill (1992) is perhaps the most recent). 

This proposal concerns a plan to build a series of 
robots that are both humanoid in form, humanoid in 
function, and to some extent humanoid in computational 
organization. While one cannot deny the romance of 
such an enterprise we are realistic enough to know that 
we can but scratch the surface of just a few of the scien- 
tific and technological problems involved in building the 
ultimate humanoid given the time scale and scope of our 
proposal, and given the current state of our knowledge. 

The reason that we should try to do this at all is that 
for the first time there is plausibly enough computation 
available. High performance parallel computation gives 
us a new tool that those before us have not had avail- 
able and that our contemporaries have chosen not to use 
in such a grand attempt. Our previous experience in 
attempting to emulate much simpler organisms than hu- 
mans suggests that in attempting to build such systems 
we will have to fundamentally change the way artificial 
intelligence, cognitive science, psychology, and linguis- 
tics think about the organization of intelligence. As a 
result, some new theories will have to be developed. We 
expect to be better able to reconcile the new theories 
with current work in neuroscience. The primary bene- 
fits from this work will be in the striving, rather than in 
the constructed artifact. 



3.1 Minds 

The traditional approach taken in artificial intelligence 
to building intelligent programs has affectionately been 
dubbed 'Good Old Fashioned AF, or GOFAI (Haugeland 
1985). It is epitomized in the modularity arguments of 
Fodor (1983) and in the physical symbol system hypoth- 
esis of Newell k Simon (1981). These approaches reduce 
AI to the problem of constructing a brain-in-a-box sym- 
bolic manipulator which would act intelligently if given 
appropriate connection to a robot (or other perceptuo- 
motor system). Still further modularization leads to in- 
dependent work on such tasks as natural language pro- 
cessing, planning, learning, and commonsense reasoning 
(e.g., Allen, Hendler k Tate (1990), Hobbs k Moore 
(1985) or Brachman k Levesque (1985)). We have ar- 
gued (Brooks 1991a) that much of GOFAI was shaped 
by the technological resources available to its researchers. 
High performance computing and communications gives 
us a new opportunity to re-shape attempts at building 
intelligent systems. 

Many modern theories are at odds with GOFAI. For 
example, Minsky (1986) suggests that the mind is a soci- 
ety of smaller agents competing and cooperating. Kins- 
bourne (1988) and Dennett (1991) argue that there is 
no place in the brain where consciousness resides. Lin- 
guists and psycholinguists have argued that the long- 
fashionable separation of language into the separate com- 
ponents of grammar and semantics is a fiction convenient 
for symbolic formulation but not useful for advancing our 
understanding of the real diversity of language phenom- 
ena (Langacker 1987), (Harris 1991). Brooks (1991a) 
proposes that human-level intelligence can be built with- 
out a single central representation of the world. Stein (to 
appear) argues that all of cognition can be seen as the 
recapitulation — through imagination — of action in the 
world. 

Many other theories of mind (e.g., Searle (1992), Edel- 
man (1987), Edelman (1989), Edelman (1992)) argue 
against the traditional AI notion of categorical represen- 
tation, and instead for a more situated model of compu- 
tation. Unfortunately these and others are flawed by fun- 
damental misunderstandings about the nature of com- 
putation and the uses of abstraction, usually centered 
around formal models of Turing machines and sometimes 
their interaction with Godel's theorem. Such arguments 
were long ago successfully debunked (Arbib 1964), but 
continue to resurface. 2 

At the other end of the spectrum is connectionism. 
Computational scientists have worked with simple ab- 
stractions of the brain for many years in two main 
waves, one in the sixties (Rosenblatt 1962), (Minsky 
k Papert 1969) and a second in the eighties (Rumel- 
hart k McClelland 1986). Unfortunately, most of this 
work is concerned with local aspects of the problem, 
rather than giving insight into how a complete system 



Different sources cite 1947 and 1948 as the time of writ- 
ing, but it was not published until long after his death. 



A more egregious version of this is (Penrose 1989) who 
not only makes the same Turing-Godel error, but then in a 
desperate attempt to find the essence of mind and applying 
the standard methodology of physics, namely to find a sim- 
plifying underlying principle, resorts to an almost mystical 
reliance on quantum mechanics. 



might be organized. 3 There have been recent attempts 
to bridge the gap in more serious ways between com- 
putation and neuroscience — in particular Churchland k 
Sejnowski (1992) — but still the gap is too large to con- 
sider neural-based approaches for a system of the scope 
we are proposing. Two of us have already been work- 
ing together (Dennett k Kinsbourne 1992), relating a 
neuroscientific theory of consciousness, dominant focus 
(Kinsbourne 1988), to a philosophical analysis of mind. 
A major intent of our proposed work is to extend that 
analysis to the point of its being an implementable the- 
ory on our humanoids. 

Recent work in neuropsychology has produced sur- 
prising results. Lesion studies, e.g. those by Damasio 
k Damasio (1989) and (McCarthy k Warrington 1990), 
indicate that the modularity of storage and access in 
the human brain is dramatically different from what our 
intuitions — as exemplified by both cognitive science and 
GOFAI — tell us. For instance it is clear that a picture of 
a dolphin provides immediate access to a different set of 
representations at a different level of generalization from 
those prompted by the verbal stimulus, 'dolphin'. In a 
normal person these representations are cross-linked, but 
in patients with certain lesions these cross-links may be 
destroyed for particular classes of entities (e.g., for an- 
imals, but not tools). 4 Likewise (Newcombe k Rat cliff 
1989) demonstrate multiple parallel channels of control 
dependent on the task, rather than, say, a single cen- 
tralized finger control module for each finger. There is 
a grounding of motor control in the different types of 
interactions the agent has with the world. 5 Nor is the 
control of attention centralized, as illustrated by studies 
of unilateral neglect (Kinsbourne 1987), but rather it is 
a matter of competition between brain systems. 

The argument is that the human brain stores things 
not only by category but also by modality — the 'repre- 
sentations' are grounded in the sensory modality used 
to learn the information. Kuipers k Byun (1991), 
Mataric (19926) and Stein (to appear) implement limited 
forms of this body-based representation in mobile robots. 
Drescher (1991), too, uses environmental interaction to 
construct representation. Still, each of these projects 
was limited by the relative poverty of the sensory suite. 
In this project, we will use the neuropsychological evi- 
dence to build a far more sophisticated instantiation of 
the body-based theory of representation and to examine 



There are exceptions to this: for instance, the work of 
Beer (1990); but that is restricted to insect level cognition. 

One particular patient (McCarthy & Warrington 1988) 
when shown a picture of a dolphin, was able to form sentences 
using the word 'dolphin' and talk about its habitat, its ability 
to be trained, and its role in the US military. When verbally 
asked what a dolphin was, however, he thought it was 'either 
a fish or a bird.' He had no such discrepancies in knowledge 
when the subject was, for example, a wheelbarrow. 

5 For instance, some patients can not exercise conscious 
control over their fingers for simple tasks, yet seem unim- 
paired in threading a needle, or playing the piano. Further- 
more in some cases selective drug induced suppression shows 
ways in which many simple reflexes combine to give the ap- 
pearance of a centralized will producing globally coherent 
behavior (Philip Teitelbaum & Pellis 1990) 



it relative to traditional theories of modularity. 

There is also evidence that what appear to be reason- 
ably well understood sensory channels within the brain 
are much more complex than we currently image. As 
one example, there is the effect known as blindsighi, 
where despite the lack of pieces or a whole visual cor- 
tex, both humans and animals can perceive, perhaps 
not consciously, certain things within their visual field 
(Weiskrantz 1986), (Braddick, Atkison, Hood, Hark- 
ness k an Faraneh Vargha-Khadem 1992). There has 
been some recent argument that these phenomena may 
be produced by partially intact visual cortex (Fendrich, 
Wessinger k Gazzaniga 1992), but even that would still 
call into question the arguments of Marr (1982) — long 
used in computer vision — that the purpose of the vision 
system is to reconstruct a 3-dimensional representation 
of what is out in the world. 

The notion that embodiment in the physical world is 
important to creating human-like intelligence is not at all 
new. Even the 1947 paper of Turing (1970) is quite con- 
cerned about this point. Later Simon (1969) discussed 
a similar point using as a parable an ant walking along 
the beach. He pointed out that the complexity of the 
behavior of the ant is more a reflection of the complex- 
ity of its environment than its own internal complexity 
and speculated that the same may be true of humans. 

The idea that our very modularity and internal orga- 
nization depends on our ways of physically interacting 
with the world is carried even further in series of philo- 
sophical arguments (Lakoff k Johnson 1980), (Lakoff 
1987), (Johnson 1987). Their central hypothesis is that 
all of our thought and language is grounded in physical 
patterns generated in our sensory and and motor sys- 
tems as we interact with the world. In particular these 
physical bases of our reason and intelligence can still 
be discerned in our language as we 'confront' the fact 
that much of our language can be 'viewed' as physical 
metaphors, 'based' on our own bodily interactions with 
the world. 

We plan on taking these notions seriously as we build 
and program our humanoids, using physical interactions 
as a basis for higher level cognitive-like behaviors. We 
have already demonstrated a simple version of these 
ideas using currently available "insect-level" robotics 
(Stein to appear). 

3.2 Symbols and Mental Representation 

The physical symbol system hypothesis maintains that 
any physical symbol system can implement intelligent 
behavior. As a consequence, it says that symbols provide 
a layer of abstraction that hides the details of perceptual 
and motor processes. 

To understand the difficulties that the physical sym- 
bol system hypothesis presents for our task, we might 
examine another similar abstraction. It is common to 
regard digital design as concerned solely with binary 
digits — discrete ones and zeros. Indeed, this digital ab- 
straction allows the use of boolean logic to synthesize 
the combinational circuits out of which our computa- 
tional elements are built. By hiding the details of analog 
voltages that constitute our systems, the digital abstrac- 



tion facilitates reasoning about and construction with 
these elements. However, the fact that the digital ab- 
straction is useful for combinational synthesis does not 
mean that it suffices for all purposes. Indeed, for certain 
elements — such as a bipolar switch — it may be necessary 
to look beneath the digital abstraction to understand the 
interactions of electrical components — e.g., to debounce 
the switch. Further, certain portions of the resulting 
system — such as the debouncing circuitry — may never 
be interpretable directly in terms of the digital abstrac- 
tion. 

Approaches that rely on the physical symbol-system 
hypothesis cannot constitute complete explanations of 
intelligence, precisely because they abstract away the de- 
tails of symbols' implementation. In order for a brain-in- 
a-box to connect to a body, all symbols must be derivable 
from sensory stimuli; but in addition, there are portions 
of the system — such as the bouncy switch — that can- 
not be seen from the symbolic side of the abstraction. 
Thus, while symbolic approaches to cognition may pro- 
vide us with tremendous insight as to how intelligence 
might work once we have symbols, it can neither tell us 
how to construct those symbols nor assist us in the iden- 
tification and manipulation of the non-symbolic portion 
of our system. 

At the opposite extreme are several non-symbolic ap- 
proaches to cognition. From connectionism to reactive 
systems to artificial life, these systems operate on stimuli 
much closer to "real" sensory input, often using difficult- 
to-comprehend processes to compute appropriate actions 
based on these stimuli. Because they are closer to actual 
sensation, these approaches have had marked success in 
certain areas (e.g., video-game playing (Agre k Chap- 
man 1987); navigation (Pomerleau 1991); "insect" intel- 
ligence (Connell 1990), (Angle k Brooks 1990)). How- 
ever, because they lack symbols or any comparable ab- 
straction, these systems are often inscrutable. A corol- 
lary is the difficulty that practitioners have had in trans- 
ferring knowledge gained in the construction of one sys- 
tem to the design of the next. Because there is little 
explicit structure, these systems generally defy descrip- 
tion by abstraction. 

We believe that the most fruitful approach will be one 
that builds on both of these traditions (e.g., Rosenschein 
k Kaelbling (1986), Kuipers k Byun (1991), Drescher 
(1991), Stein (to appear), Yanco k Stein (1993)). Just as 
the digital abstraction is useful for the designer of combi- 
national circuits, so the symbolic abstraction will be in- 
valuable for the designer of cognitive components. How- 
ever, combinational circuits are built out of raw voltages, 
not out of ones and zeros: the binary digits are in the 
mind of the designer. Similarly, the symbolic abstrac- 
tion will be a crucial tool in the analysis and synthesis 
of our humanoids; but we do not necessarily expect these 
symbols to appear explicitly in the humanoid's head. 

Thus, both of these pieces will inform our ap- 
proach to representation. However, it is not at all 
clear that a single "symbol" (in the conventional sense, 
e.g., 'dolphin') will have a unitary representation (e.g., 
in the human brain the image of a dolphin may be 
stored separately from categorical knowledge about dol- 



phins as sea creatures). As a result, we will need to 
broaden the conventional definitions. We expect to use 
lower level modules — derived, e.g., from more 'reactive' 
approaches — to come up with appropriate responses to 
stimuli. From these, we will identify patterns of behav- 
ior that represent generalizations — proto-symbols — and 
use these to establish reasoning that appears to be more 
"symbolic" . 

There is an argument that certain components of 
stimulus-response systems are "symbolic." For example, 
if a particular neuron fires — or a particular wire carries 
a positive voltage — whenever something red is visible, 
that neuron — or wire — may be said to "represent" the 
presence of something red. While this argument may be 
perfectly reasonable as an observer's explanation of the 
system, it should not be mistaken for an explanation of 
what the agent in question believes. In particular, the 
positive voltage on the wire does not represent the pres- 
ence of red to the agent; the positive voltage is the pres- 
ence of something red as far as the robot is concerned. 

The digital abstraction is not a statement about how 
things are; it is merely a way of viewing them. A com- 
binational circuit may be analyzed in terms of boolean 
logic, but it is voltages, not a collection of ones and ze- 
ros. (Or, perhaps, it is electrons moving in a particular 
way.) At best, the digital abstraction tells us that the 
combinational circuit is amenable to analysis in term of 
ones and zeros; but it does not change the reality of what 
is there. 

Similarly, the utility of the symbolic abstraction in an- 
alyzing rational behavior does not indicate that there are 
actually entities corresponding to symbols in the brain. 
Rather, it indicates that the brain — or, more likely, 
portions of the brain (viz. the debounced switch) — are 
amenable to analysis in symbolic terms. It does not 
change the fact that everything in the brain is (sub- 
symbolic) neural activity; nor does the equation of brain 
function with neural activity rule out the utility of a 
symbolic explanation. 

In building a human oid, we will begin at this sensory 
level. All intelligence will be grounded in computation on 
sensory information or on information derived from sen- 
sation. However, some of this computation will abstract 
away from explicit sensation, generalizing, e.g., over sim- 
ilar situations or sensory inputs. Through sensation and 
action, the humanoid will experience a conceptualization 
of space: "up," "down," "near," "far," etc. We hypoth- 
esize that at this point it will be useful for observers to 
describe the behavior of the humanoid in symbolic terms. 
("It put the red blocks together.") This is the first step 
in representation. 

The next step involves a jump from the view of sym- 
bols as a convenient but post hoc explanation (i.e., for 
an observer) to a view in which symbols, somehow, ap- 
pear to the agent to exist inside the agent's head. This 
second step is facilitated by language, one of the tools 
that allows us to become observers of ourselves. This is 
the trick of consciousness: the idea that "we" exist, that 
one part of us is observing another. 

Although there is good evidence that consciousness is 
anything but a simple phenomenon (i.e., that the reality 



is far more complex than our post hoc reconstruction of 
it) (Springer & Deutsch 1981), it almost certainly does 
have some of the properties that we attribute to it. 

With language, symbols become more than merely a 
post hoc explanation by others of the workings of our 
own brains; symbols become our own explanation to our- 
selves. It is this ability to distance ourselves from our 
own symbols that gives rise to our illusions of conscious- 
ness (Bickhard n.d.). How can we produce these "sym- 
bolic" associations? The same processes that produce 
responses from sensory inputs can be stimulated inter- 
nally. For example, Kosslyn (1993) has demonstrated 
that portions of the visual cortex are implicated in visual 
imagery, suggesting precisely this sort of self-stimulation. 
Stein (to appear) takes a similar approach to add cogni- 
tive capacity to a behavior-based robot. 

We can summarize our approach to representation as 
follows: Stimulus-response systems abstract away from 
particular inputs to treat large classes of inputs simi- 
larly. This begins the "generalization" of particular stim- 
uli into complex reactions and the external appearance 
of categorization, or proto-symbols. Next, these abstrac- 
tions begin to be produced without resorting to actual 
sensory inputs. Symbol-like behavior results, but with- 
out instantiating symbols directly. 

4 High Performance Computing 

We are proposing a very different way to use high perfor- 
mance computation and communication, and proposing 
to use it in a domain which promises to become a major 
consumer of computation: intelligent embodied agents 
that interact with humans. 

While traditional parallel processors are designed to 
act like fast serial computers, we are addressing an in- 
herently parallel task. Indeed, while for most of com- 
puter science the translation to parallel hardware has 
imposed additional complexity (and, indeed, much cur- 
rent research is devoted to minimizing the overhead of 
this translation), we anticipate a significant simplifica- 
tion of our task in virtue of the parallel hardware avail- 
able. 

Much of the work on high performance computation 
is benchmarked in terms of how it speeds up numerical 
simulations of physical phenomena (Cypher, Ho, Kon- 
stantinidou & Messina 1993). In these domains there is 
a well defined set of computations that given a valid set 
of initial conditions are guaranteed to be well behaved in 
some sense, generating a sufficiently accurate simulation 
of how events will unfold over time. Data is collected 
along the way, and a final summary of how the modeled 
system evolved over time is the result of the computa- 
tion. The model of a computation is very much that of an 
algorithm that is given input data and, after some suit- 
able computation, outputs some data. As a result, much 
of the research into high performance parallel computers 
is concerned with how to present a shared memory that 
can be accessed quickly by all processors, leading to the 
need for local caching schemes and high speed switching 
networks; how to make sure that all such views of mem- 
ory are consistent, leading to the need for handling cache 
coherence; and how to dynamically balance the load on 



all processors, given the implicit understanding that the 
goal of the whole job is to complete the computation as 
quickly as possible. 

In our "problem" the constraints are very different. 
By the nature of the system we do not need to migrate 
processes, do not need a shared memory, and do not 
need to dynamically redirect messages. Simple "hard 
wired" messages networks should suffice, with memory 
only local to each processor. The goal is not to "finish" 
a computation as quickly as possible but instead to pass 
the data through a process in a bounded amount of time 
so that the next data that the world presents to the sys- 
tem can flow through without getting blocked or lost. 
There is no end to a computation or final result; all is 
continuously being computed and recomputed, and ac- 
tions in the world are the "outputs" of the system. But 
the computation is not simply linear in ordering. There 
must be many pathways between sensors and actuators, 
some with very different latencies, each one contributing 
to some aspect of the resulting behavior of the system. 

We need high performance and parallel computing in 
order to guarantee the bounds on computation time of 
any particular step in the processes. We will push on the 
organization of computation to do useful tasks directly in 
the real world, and will be pushing in a direction which 
should lead to inherently simpler-to-construct massively 
parallel computers. The applications of this sort of pro- 
cessing will be wide ranging and indeed may well become 
pervasive throughout our society. 

Our problem is more one of maintenance of activity 
rather than achievement of a single solution to a prob- 
lem. 

We need parallelism because of the vast amounts of 
processing that needs to be done in order to make sense 
of a continuous and rich stream of perceptual data. We 
need parallelism to coordinate the many actuation sys- 
tems that need to work in synchrony (e.g., the ocular 
system and the neck must move in a coordinated fashion 
at time to maintain image stability) and which need to 
be servoed at high rates. We need parallelism in order 
to have a continuously operating system that can be up- 
graded without having to recompile, reload, and restart 
all of the software that runs the stable lower level aspects 
of the humanoid. And finally we need parallelism for the 
cognitive aspects of the system as we are attempting to 
build a system with more capability than can fit on any 
existing single processor. 

But, in real-time embedded systems there is another 
necessary reason for parallelism. It is the fact that there 
are many things to be attended to happening in the 
world continuously, independently of the agent. From 
this comes the notion of an agent being situated in the 
world. Not only must the agent devote attention to per- 
haps hundreds of different sensors many times per sec- 
ond, but it must also devote attention "down stream" in 
the processing chain in many different places at many 
times per second as the processed sensor data flows 
through the system. The actual amounts of computation 
needed to be done by each of these individual processes 
is in fact quite small, so small that originally we formal- 
ized them as augmented finite state machines (Brooks 



1986), although more recently we have thought of them 
as real-time rules (Brooks 1990a). They are too small 
to have a complete processor devoted to them in any 
machine beyond a CM-2, and even there the processors 
would be mostly idle. A better approach is to simulate 
parallelism in a single conventional processor with its 
own local memory. 

Our humanoid robot will be situated in a real world 
over which it has very little control. There will be people 
present, moving about, changing the physical environs of 
the humanoid, responding to actions of the humanoid, 
and generating spontaneous behaviors themselves. The 
task for the humanoid will be to interact with these ul- 
timately unpredictable agents in a coherent way. It will 
get a continuous large and rich stream of input data of 
which it must make sense, relating it to past experiences 
and future possibilities in the world. It will be a partic- 
ipant in this world and must act with appropriate speed 
and grace. 

5 Hardware and Software Experimental 
Platforms 

We have extensive experience in building mobile robots. 
The Pi's have been directly involved in the design 
and construction of over 35 different designs for mobile 
robots, and with multiple instances of many of these 
types of robots — over 100 robots in total. 

In that previous work with mobile robots, we started 
out thinking we would build one mobile robot that would 
be a platform for research for a generation of gradu- 
ate students (Brooks 1986). That soon changed as we 
realized three things: (1) trying to design everything 
into one robot caused too many compromises in our re- 
search goals as early experiments soon pointed to mul- 
tiple different sensor/actuator suites which needed to 
be explored, (2) graduate students working on some- 
what separate thesis projects needed their own robots if 
they were to do extensive multi- hundred hours of opera- 
tion experiments, rather than simple validation demon- 
strations in controlled environments as were often con- 
ducted in many research projects (Brooks 19916) and (3) 
by continually re-engineering our designs we gradually 
built more robust robots with longer mean times between 
catastrophic failures. 6 Building many robots over a short 
period of time led to rapid increases in performance over 
a diverse set of robot morphologies (Brooks (1986), Con- 
nell (1987), Horswill & Brooks (1988), Brooks (1989), 
Connell (1990), Angle k Brooks (1990), Mataric (19926), 
Mataric (1992a), Ferrell (1993), Horswill (1993); see 
Brooks (19906) for an overview). At the same time, 
a common software system (Brooks 1990 a) was devel- 
oped which ran on many different processors, but pro- 
vided a common environment for programming all the 

6 This observation parallels the developments in digital 
computers, where mean time between failures in the 1950's 
was in the 20 minute range, extending to periods of a week 
in the 1970's, and now typically we are not surprised when 
our workstations run for months without needing to be 
rebooted — this increase in robustness was bought with many 
hundreds of iterations of the engineering cycle. 



diverse robots. Brooks (19906) gives a mid-course re- 
view of some of those robots. 

In this project too, we expect that there will be great 
benefits from building the humanoid repeatedly over the 
life of the project and from running the software on mul- 
tiple computer architectures, taking advantage in both 
cases of technological developments that will occur inde- 
pendently of this project. At the same time we will be 
following a learning curve, increasing our engineering so- 
phistication and the inherent robustness of the systems 
we build. 

To this end we have already started building the zero- 
th version of the humanoid over the summer of 1993, 
relying on current supplies in stock and largely off the 
shelf components which are being purchased with mod- 
est amounts of unrestricted funds from previous dona- 
tions. At the same time a more extensive software de- 
velopment effort is under way. We expect the zero-th 
generation hardware to disappear within a few months, 
but the software will form the kernel of future systems. 

5.1 Brains 

Our goal is to take advantage of the new availability of 
massively parallel computation in dedicated machines. 
We need parallelism because of the vast amounts of pro- 
cessing that must be done in order to make sense of a 
continuous and rich stream of perceptual data. We need 
parallelism to coordinate the many actuation systems 
that need to work in synchrony (e.g., the ocular system 
and the neck must move in a coordinated fashion at time 
to maintain image stability) and which need to be ser- 
voed at high rates. We need parallelism in order to have 
a continuously operating system that can be upgraded 
without having to recompile, reload, and restart all of 
the software that runs the stable lower level aspects of 
the humanoid. And finally we need parallelism for the 
cognitive aspects of the system as we are attempting to 
build a "brain" with more capability than can fit on any 
existing single processor. 

But in real-time embedded systems there is yet an- 
other necessary reason for parallelism. It is the fact 
that there are many things to be attended to, happen- 
ing in the world continuously, independent of the agent. 
From this comes the notion of an agent being situated 
in the world. Not only must the agent devote atten- 
tion to perhaps hundreds of different sensors many times 
per second, but it must also devote attention "down 
stream" in the processing chain in many different places 
at many times per second as the processed sensor data 
flows through the system. The actual amounts of com- 
putation needed to be done by each of these individual 
processes is in fact quite small, so small that originally 
we formalized them as augmented finite state machines 
(Brooks 1986), although more recently we have thought 
of them as real-time rules (Brooks 1990a). They are too 
small to have a complete processor devoted to them in 
any machine beyond a CM-2, and even there the pro- 
cessors would be mostly idle. A better approach is to 
simulate parallelism in a single conventional processor 
with its own local memory. 

For instance, Ferrell (1993) built a software system 



to control a 19 actuator six legged robot using about 60 
of its sensors. She implemented it as more than 1500 
parallel processes running on a single Phillips 68070. (It 
communicated with 7 peripheral processors which han- 
dled sensor data collection and 100Hz motor servoing.) 
Most of these parallel processes ran at rates varying be- 
tween 10 and 25 Hertz. Each time each process ran, it 
took at most a few dozen instructions before blocking, 
waiting either for the passage of time or for some other 
process to send it a message. Clearly, low cost context 
switching was important. 

The underlying computational model used on that 
robot — and with many tens of other autonomous mobile 
robots we have built — consisted of networks of message- 
passing augmented finite state machines. Each of these 
AFSMs was a separate process. The messages were sent 
over predefined 'wires' from a specific transmitting to 
a specific receiving AFSM. The messages were simple 
numbers (typically 8 bits) whose meaning depended on 
the designs of both the transmitter and the receiver. An 
AFSM had additional registers which held the most re- 
cent incoming message on any particular wire. This gives 
a very simple model of parallelism, even simpler than 
that of CSP (Hoare 1985). The registers could have their 
values fed into a local combinatorial circuit to produce 
new values for registers or to provide an output mes- 
sage. The network of AFSMs was totally asynchronous, 
but individual AFSMs could have fixed duration monos- 
tables which provided for dealing with the flow of time 
in the outside world. The behavioral competence of the 
system was improved by adding more behavior-specific 
network to the existing network. This process was called 
layering. This was a simplistic and crude analogy to 
evolutionary development. As with evolution, at every 
stage of the development the systems were tested. Each 
of the layers was a behavior-producing piece of network 
in its own right, although it might implicitly rely on the 
presence of earlier pieces of network. For instance, an 
explore layer did not need to explicitly avoid obstacles, 
as the designer knew that a previous avoid layer would 
take care of it. A fixed priority arbitration scheme was 
used to handle conflicts. 

On top of the AFSM substrate we used another 
abstraction known as the Behavior Language, or BL 
(Brooks 1990a), which was much easier for the user 
to program with. The output of the BL compiler was 
a standard set of augmented finite state machines; by 
maintaining this compatibility all existing software could 
be retained. When programming in BL the user has com- 
plete access to full Common Lisp as a meta-language by 
way of a macro mechanism. Thus the user could eas- 
ily develop abstractions on top of BL, while still writ- 
ing programs which compiled down to networks of AF- 
SMs. In a sense, AFSMs played the role of assembly 
language in normal high level computer languages. But 
the structure of the AFSM networks enforced a program- 
ming style which naturally compiled into very efficient 
small processes. The structure of the Behavior Language 
enforced a modularity where data sharing was restricted 
to smallish sets of AFSMs, and whose only interfaces 
were essentially asynchronous 1-deep buffers. 



In the humanoid project we believe much of the com- 
putation, especially for the lower levels of the system, 
will naturally be of a similar nature. We expect to 
perform different experiments where in some cases the 
higher level computations are of the same nature and in 
other cases the higher levels will be much more symbolic 
in nature, although the symbolic bindings will be re- 
stricted to within individual processors. We need to use 
software and hardware environments which give support 
to these requirements without sacrificing the high levels 
of performance of which we wish to make use. 

5.1.1 Software 

For the software environment we have a number of 
requirements: 

• There should be a good software development en- 
vironment. 

• The system should be completely portable over 
many hardware environments, so that we can up- 
grade to new parallel machines over the lifetime of 
this project. 

• The system should provide efficient code for per- 
ceptual processing such as vision. 

• The system should let us write high level symbolic 
programs when desired. 

• The system language should be a standardized lan- 
guage that is widely known and understood. 

In summary our software environment should let us gain 
easy access to high performance parallel computation. 

We have chosen to use Common Lisp (Steele Jr. 1990) 
as the substrate for all software development. This 
gives us good programming environments including type 
checked debugging, rapid prototyping, symbolic compu- 
tation, easy ways of writing embedded language abstrac- 
tions, and automatic storage management. We believe 
that Common Lisp is superior to C (the other major 
contender) in all of these aspects. 

The problem then is how to use Lisp in a massively 
parallel machine where each node may not have the vast 
amounts of memory that we have become accustomed 
to feeding Common Lisp implementations on standard 
Unix boxes. 

We have a long history of building high performance 
Lisp compilers (Brooks, Gabriel & Steele Jr. 1982), in- 
cluding one of the two most common commercial Lisp 
compilers on the market; Lucid Lisp — Brooks, Posner, 
McDonald, White, Benson k Gabriel (1986). 

Recently we have developed L (Brooks 1993), a re- 
targetable small efficient Lisp which is a downwardly 
compatible subset of Common Lisp. When compiled 
for a 68000 based machine the load image (without the 
compiler) is only 140K bytes, but includes multiple val- 
ues, strings, characters, arrays, a simplified but com- 
patible package system, all the "ordinary" aspects of 
format, back quote and comma, setf etc., full Common 
Lisp lambda lists including optionals and keyword argu- 
ments, macros, an inspector, a debugger, def struct (in- 
tegrated with the inspector), block, catch, and throw, 



etc., full dynamic closures, a full lexical interpreter, float- 
ing point, fast garbage collection, and so on. The com- 
piler runs in time linear in the size of an input expression, 
except in the presence of lexical closures. It neverthe- 
less produces highly optimized code in most cases. L is 
missing flet and labels, generic arithmetic, bignums, 
rationals, complex numbers, the library of sequence func- 
tions (which can be written within L) and esoteric parts 
of format and packages. 

The L system is an intellectual descendent of the dy- 
namically retargetable Lucid Lisp compiler (Brooks et al. 
1986) and the dynamically retargetable Behavior Lan- 
guage compiler (Brooks 1990 a). The system is totally 
written in L with machine dependent backends for re- 
targetting. The first backend is for the Motorola 68020 
(and upwards) family, but it is easily retargeted to new 
architectures. The process consists of writing a simple 
machine description, providing code templates for about 
100 primitive procedures (e.g., fixed precision integer +, 
*, =, etc., string indexing CHAR and other accessors, CAR, 
CDR, etc.), code macro expansion for about 20 pseudo 
instructions (e.g, procedure call, procedure exit, check- 
ing correct number of arguments, linking CATCH frames, 
etc.) and two corresponding sets of assembler routines 
which are too big to be expanded as code templates ev- 
ery time, but are so critical in speed that they need to be 
written in machine language, without the overhead of a 
procedure call, rather than in Lisp (e.g., COIS, spreading 
of multiple values on the stack, etc.). There is a version 
of the I/O system which operates by calling C routines 
(e.g., fgetchar, etc.; this is how the Macintosh version 
of L runs) so it is rather simple to port the system to any 
hardware platform we might choose to use in the future. 

Note carefully the intention here: L is to be the de- 
livery vehicle running on the brain hardware of the hu- 
manoid, potentially on hundreds or thousands of small 
processors. Since it is fully downward compatible with 
Common Lisp however, we can carry out code develop- 
ment and debugging on standard work stations with full 
programming environments (e.g., in Macintosh Common 
Lisp, or Lucid Common Lisp with Emacs 19 on a Unix 
box, or in the Harlequin programming environment on a 
Unix box). We can then dynamically link code into the 
running system on our parallel processors. 

There are two remaining problems: (1) how to main- 
tain super critical real-time performance when using a 
Lisp system without hard ephemeral garbage collection, 
and (2) how to get the level of within-processor paral- 
lelism described earlier. 

The structure of L's implementation is such that mul- 
tiple independent heaps can be maintained within a sin- 
gle address space, sharing all the code and data segments 
of the Lisp proper. In this way super-critical portions of 
a system can be placed in a heap where no consing is oc- 
curring, and hence there is no possibility that they will 
be blocked by garbage collection. 

The Behavior Language (Brooks 1990 a) is an exam- 
ple of a compiler which builds special purpose static 
schedulers for low overhead parallelism. Each process 
ran until blocked and the syntax of the language forced 
there to always be a blocking condition, so there was no 



need for pre-emptive scheduling. Additionally the syn- 
tax and semantics of the language guaranteed that there 
would be zero stack context needed to be saved when a 
blocking condition was reached. We will need to build 
a new scheduling system with L to address similar is- 
sues in this project. To fit in with the philosophy of the 
rest of the system it must be a dynamic scheduler so 
that new processes can be added and deleted as a user 
types to the Lisp listener of a particular processor. Rea- 
sonably straightforward data structures can keep these 
costs to manageable levels. It is rather straightforward 
to build a phase into the L compiler which can recognize 
the situations described above. Thus it is straightfor- 
ward to implement a set of macros which will provide a 
language abstraction on top of Lisp which will provide 
all the functionality of the Behavior Language and which 
will additionally let us have dynamic scheduling. Almost 
certainly a pre-emptive scheduler will be needed in ad- 
dition, as it would be difficult to enforce a computation 
time limit syntactically when Common Lisp will essen- 
tially be available to the programmer — at the very least 
the case of the pre-emptive scheduler having to strike 
down a process will be useful as a safety device, and 
will also act as a debugging tool for the user to iden- 
tify time critical computations which are stressing the 
bounded computation style of writing. In other cases 
static analysis will be able to determine maximum stack 
requirements for a particular process, and so heap allo- 
cated stacks will be usable 7 . 

The software system so far described will be used to 
implement crude forms of 'brain models', where compu- 
tations will be organized in ways inspired by the sorts of 
anatomical divisions we see occurring in animal brains. 
Note that we are not saying we will build a model of a 
particular brain, but rather there will be a modularity in- 
spired by such components as visual cortex, auditory cor- 
tex, etc., and within and across those components there 
will be further modularity, e.g., a particular subsystem 
to implement the vestibulo-ocular response (VOR). 

Thus besides on-processor parallelism we will need to 
provide a modularity tool that packages processes into 
groups and limits data sharing between them. Each 
package will reside on a single processor, but often 
processors will host many such packages. A package 
that communicates with another package should be in- 
sulated at the syntax level from knowing whether the 
other package is on the same or a different proces- 
sor. The communication medium between such packages 
will again be 1-deep buffers without queuing or receipt 
acknowledgment — any such acknowledgment will need to 
be implemented as a backward channel, much as we see 
throughout the cortex (Churchland & Sejnowski 1992). 
This packaging system can be implemented in Common 
Lisp as a macro package. 

We expect all such system level software development 
to be completed in the first twelve months of the project. 



The problem with heap allocated stacks in the general 
case is that there will be no overflow protection into the rest 
of heap. 



5.1.2 Computational Hardware 

The computational model presented in the previ- 
ous section is somewhat different from that usually as- 
sumed in high performance parallel computer applica- 
tions. Typically (Cypher et al. 1993) there is a strong 
bias on system requirements from the sort of benchmarks 
that are used to evaluate performance. The standard 
benchmarks for modern high performance computation 
seem to be Fortran codes for hydrodynamics, molecu- 
lar simulations, or graphics rendering. We are propos- 
ing a very different application with very different re- 
quirements; in particular we require real-time response 
to a wide variety of external and internal events, we re- 
quire good symbolic computation performance, we re- 
quire only integer rather than high performance float- 
ing point operations, 8 we require delivery of messages 
only to specific sites determined at program design time, 
rather than at run-time, and we require the ability to do 
very fast context switches because of the large number 
of parallel processes that we intend to run on each indi- 
vidual processor. 

The fact that we will not need to support pointer ref- 
erences across the computational substrate will mean 
that we can rely on much simpler, and therefore 
higher performance, parallel computers than many other 
researchers — we will not have to worry about a consis- 
tent global memory, cache coherence, or arbitrary mes- 
sage routing. Since these are different requirements than 
those that are normally considered, we have to make 
some measurements with actual programs before we can 
we can make an intelligent off the shelf choice of com- 
puter hardware. 

In order to answer some of these questions we are cur- 
rently building a zero-th generation parallel computer. It 
is being built on a very low budget with off the shelf com- 
ponents wherever possible (a few fairly simple printed 
circuit boards need to be fabricated). The processors 
are 16Mhz Motorola 68332s on a standard board built 
by Vesta Technology. These plug 16 to a backplane. 
The backplane provides each processor with six commu- 
nications ports (using the integrated timing processor 
unit to generate the required signals along with spe- 
cial chip select and standard address and data lines) 
and a peripheral processor port. The communications 
ports will be hand-wired with patch cables, building a 
fixed topology network. (The cables incorporate a single 
dual ported RAM (8K by 16 bits) that itself includes 
hardware semaphores writable and readable by the two 
processors being connected.) Background processes run- 
ning on the 68332 operating system provide sustained 
rate transfers of 60Hz packets of 4K bytes on each port, 
with higher peak rates if desired. These sustained rates 
do consume processing cycles from the 68332. On non- 
vision processors we expect much lower rates will be 
needed, and even on vision processors we can proba- 
bly reduce the packet frequency to around 15Hz. Each 



8 Consider the dynamic range possible in single signal 
channels in the human brain and it soon becomes apparent 
that all that we wish to do is certainly achievable with nei- 
ther span of 600 orders of magnitude, or 47 significant binary 
digits. 



processor has an operating system, L, and the dynamic 
scheduler residing in 1M of EPROM. There is 1M of 
RAM for program, stack and heap space. Up to 256 
processors can be connected together. 

Up to 16 backplanes can be connected to a single front 
end processor (FEP) via a shared 500K baud serial line 
to a SCSI emulator. A large network of 68332s can span 
many FEPs if we choose to extend the construction of 
this zero-th prototype. Initially we will use a Macintosh 
as a FEP. Software written in Macintosh Common Lisp 
on the FEP will provide disk I/O services to the 68332's, 
monitor status and health packets from them, and pro- 
vide the user with a Lisp listener to any processor they 
might choose. 

The zero-th version uses the standard Motorola SPI 
(serial peripheral interface) to communicate with up to 
16 Motorola 6811 processors per 68332. These are a sin- 
gle chip processor with onboard EEPROM (2K bytes) 
and RAM (256 bytes), including a timer system, an SPI 
interface, and 8 channels of analog to digital conversion. 
We are building a small custom board for this proces- 
sor that includes opto-isolated motor drivers and some 
standard analog support for sensors 9 . 

We expect our first backplane to be operational by 
August 1st, 1993 so that we can commence experiments 
with our first prototype body. We will collect statistics 
on inter-processor communication throughput, effects of 
latency, and other measures so that we can better choose 
a larger scale parallel processor for more serious versions 
of the humanoid. 

In the meantime, however, there are certain develop- 
ments on the horizon within the MIT Artificial Intel- 
ligence Lab which we expect to capitalize upon in or- 
der to dramatically upgrade our computational systems 
for early vision, and hence the resolution at which we 
can afford to process images in real time. The first of 
these, expected in the fall will be a somewhat similar 
distributed processing system based on the much higher 
performance Texas Instrument C40, which comes with 
built in support for fixed topology message passing. We 
expect these systems to be available in the Fall '93 time- 
frame. In October '94 we expect to be able to make use 
of the Abacus system, a bit level reconfigurable vision 
front-end processor being built under ARPA sponsorship 
which promises Tera-op performance on 16 bit fixed pre- 
cision operands. Both these systems will be simply inte- 
grable with our zero-th order parallel processor via the 
standard dual-ported RAM protocol that we are using. 

5.2 Bodies 

As with the computational hardware, we are also cur- 
rently engaged in building a zero-th generation body 
for early experimentation and design refinement towards 
more serious constructions within the scope of this pro- 

9 We currently have 28 operational robots in our labs each 
with between 3 and 5 of these 6811 processors, and several 
dozen other robots with at least 1 such processor on board. 
We have great experience in writing compiler backends for 
these processors (including BL) and great experience in us- 
ing them for all sorts of servoing, sensor monitoring, and 
communications tasks. 



posal. We are presently limited by budgetary constraints 
to building an immobile, armless, deaf, torso with only 
black and white vision. 

In the following subsections we outline the constraints 
and requirements on a full scale humanoid body and also 
include where relevant details of our zero-th level proto- 
type. 

5.2.1 Eyes 

There has been quite a lot of recent work on animate 
vision using saccading stereo cameras, most notably at 
Rochester (Ballard 1989), (Coombs 1992), but also more 
recently at many other institutions, such as Oxford Uni- 
versity. 

The humanoid needs a head with high mechanical 
performance eyeballs and foveated vision if it is to be 
able to participate in the world with people in a natu- 
ral way. Even our earliest heads will include two eyes, 
with foveated vision, able to pan and tilt as a unit, and 
with independent saccading ability (three saccades per 
second) and vergence control of the eyes. Fundamen- 
tal vision based behaviors will include a visually cali- 
brated vestibular-ocular reflex, smooth pursuit, visually 
calibrated saccades, and object centered foveal relative 
depth stereo. Independent visual systems will provide 
peripheral and foveal motion cues, color discrimination, 
human face pop-outs, and eventually face recognition. 
Over the course of the project, object recognition based 
based on "representations" from body schemas and ma- 
nipulation interactions will be developed. This is com- 
pletely different from any conventional object recogni- 
tion schemes, and can not be attempted without an in- 
tegrated vision and manipulation environment as we pro- 
pose. 

The eyeballs need to be able to saccade up to about 
three times per second, stabilizing for 250ms at each 
stop. Additionally the yaw axes should be controllable 
for vergence to a common point and drivable in a man- 
ner appropriate for smooth pursuit and for image stabi- 
lization as part of a vestibulo-ocular response (VOR) to 
head movement. The eyeballs do not need to be force 
or torque controlled but they do need good fast position 
and velocity control. We have previously built a sin- 
gle eyeball, A-eye, on which we implemented a model of 
VOR, ocular-kinetic response (OKR) and saccades, all 
of which used dynamic visually based calibration (Viola 
1990). 

Other active vision systems have had both eyeballs 
mounted on a single tilt axis. We will begin experiments 
with separate tilt axes but if we find that relative tilt 
motion is not very useful we will back off from this re- 
quirement in later versions of the head. 

The cameras need to cover a wide field of view, prefer- 
ably close to 180 degrees, while also giving a foveated 
central region. Ideally the images should be RGB (rather 
than the very poor color signal of standard NTSC). A 
resolution of 512 by 512 at both the coarse and fine scale 
is desirable. 

Our zero-th version of the cameras are black and white 
only. Each eyeball consists of two small lightweight cam- 
eras mounted with parallel axes. One gives a 115 degree 
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field of view and the other gives a 20 degree foveated 
region. In order to handle the images in real time in our 
zero-th parallel processor we will subsample the images 
to be much smaller than the ideal. 

Later versions of the head will have full RGB color 
cameras, wider angles for the peripheral vision, much 
finer grain sampling of the images, and perhaps a col- 
inear optics set up using optical fiber cables and beam 
splitters. With more sophisticated high speed process- 
ing available we will also be able to do experiments with 
log-polar image representations. 

5.2.2 Ears, Voice 

Almost no work has been done on sound understand- 
ing, as distinct from speech understanding. This project 
will start on sound understanding to provide a much 
more solid processing base for later work on speech in- 
put. Early behavior layers will spatially correlate noises 
with visual events, and spatial registration will be con- 
tinuously self calibrating. Efforts will concentrate on us- 
ing this physical cross-correlation as a basis for reliably 
pulling out interesting events from background noise, 
and mimicking the cocktail party effect of being able 
to focus attention on particular sound sources. Visual 
correlation with face pop-outs, etc., will then be used 
to be able to extract human sound streams. Work will 
proceed on using these sounds streams to mimic infant's 
abilities to ignore language dependent irrelevances. By 
the time we get to elementary speech we will therefore 
have a system able to work in noisy environments and 
accustomed to multiple speakers with varying accents. 

Sound perception will consist of three high quality 
microphones. (Although the human head uses only two 
auditory inputs, it relies heavily on the shape of the ex- 
ternal ear in determining the vertical component of di- 
rectional sound source.) Sound generation will be ac- 
complished using a speaker. 

Sound is critical for several aspects of the robot's ac- 
tivity. First, sound provides immediate feedback for mo- 
tor manipulation and positioning. Babies learn to find 
and use their hands by batting at and manipulating toys 
that jingle and rattle. Adults use such cues as contact 
noises — the sound of an object hitting the table — to pro- 
vide feedback to motor systems. Second, sound aids 
in socialization even before the emergence of language. 
Patterns such as turn-taking and mimicry are critical 
parts of children's development, and adults use guttural 
gestures to express attitudes and other conversational 
cues. Certain signal tones indicate encouragement or 
disapproval to all ages and stages of development. Fi- 
nally, even pre-verbal children use sound effectively to 
convey intent; until our robots develop true language, 
other sounds will necessarily be a major source of com- 
munication. 

5.2.3 Torsos 

In order for the humanoid to be able to participate in 
the same sorts of body metaphors as are used by humans, 
it needs to have a symmetric human-like torso. It needs 
to be able to experience imbalance, feel symmetry, learn 
to coordinate head and body motion for stable vision, 
and be able to experience relief when it relaxes its body. 



Additionally the torso must be able to support the head, 
the arms, and any objects they grasp. 

The torsos we build will initially have a three degree 
of freedom hip, with the axes passing through a common 
point, capable of leaning and twisting to any position in 
about three seconds — somewhat slower than a human. 
The neck will also have three degrees of freedom, with 
the axes passing through a common point which will also 
lie along the spinal axis of the body. The head will be 
capable of yawing at 90 degrees per second — less than 
peak human speed, but well within the range of natural 
human motions. As we build later versions we expect to 
increase these performance figures to more closely match 
the abilities of a human. 

Apart from the normal sorts of kinematic sensors, the 
torso needs a number of additional sensors specifically 
aimed at providing input fodder for the development of 
bodily metaphors. In particular, strain gauges on the 
spine can give the system a feel for its posture and the 
symmetry of a particular configuration, plus a little in- 
formation about any additional load the torso might bear 
when an arm picks up something heavy. Heat sensors on 
the motors and the motor drivers will give feedback as 
to how much work has been done by the body recently, 
and current sensors on the motors will give an indication 
of how hard the system is working instantaneously. 

Our zero-th level torso is roughly 18 inches from the 
base of the spine to the base of the neck. This corre- 
sponds to a smallish adult. It uses DC motors with built 
in gearboxes. The main concern we have is how quiet it 
will be, as we do not want the sound perception system 
to be overwhelmed by body noise. 

Later versions of the torsos will have touch sensors 
integrated around the body, will have more compliant 
motion, will be quieter, and will need to provide better 
cabling ducts so that the cables can all feed out through 
a lower body outlet. 

5.2.4 Arms 

The eventual manipulator system will be a compliant 
multi-degree of freedom arm with a rather simple hand. 
(A better hand would be nice, but hand research is not 
yet at a point where we can get an interesting, easy-to 
use, off-the-shelf hand.) The arm will be safe enough 
that humans can interact with it, handing it things and 
taking things from it. The arm will be compliant enough 
that the system will be able to explore its own body — 
for instance, by touching its head system — so that it will 
be able to develop its own body metaphors. The full 
design of the even the first pair of arms is not yet com- 
pletely worked out, and current funding does not permit 
the inclusion of arms on the zero-th level humanoid. In 
this section, we describe our desiderata for the arms and 
hands. 

We want the arms to be very compliant yet still able 
to lift weights of a few pounds so that they can interact 
with human artifacts in interesting ways. Additionally 
we want the arms to have redundant degrees of freedom 
(rather than the six seen in a standard commercial robot 
arm), so that in many circumstances we can 'burn' some 
of those degrees of freedom in order to align a single 
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joint so that the joint coordinates and task coordinates 
very nearly match. This will greatly simplify control of 
manipulation. It is the sort of thing people do all the 
time: for example, when bracing an elbow or the base 
of the palm (or even their middle and last two fingers) 
on a table to stabilize the hand during some delicate (or 
not so delicate) manipulation. 

The hands in the first instances will be quite simple; 
devices that can grasp from above relying heavily on 
mechanical compliance — they may have as few as one 
degree of control freedom. 

More sophisticated, however, will be the sensing on 
the arms and hands. We will use forms of conductive 
rubber to get a sense of touch over the surface of the 
arm, so that it can detect (compliant) collisions it might 
participate in. As with the torso there will be liberal use 
of strain gauges, heat sensors and current sensors so that 
the system can have a 'feel' for how its arms are being 
used and how they are performing. 

We also expect to move towards a more sophisticated 
type of hand in later years of this project. Initially, un- 
fortunately, we will be forced to use motions of the upper 
joints of the arm for fine manipulation tasks. More so- 
phisticated hands will allow us to use finger motions, 
with much lower inertias, to carry out these tasks. 

6 Development Plan 

We plan on modeling the brain at a level above the neural 
level, but below what would normally be thought of as 
the cognitive level. 

We understand abstraction well enough to know how 
to engineer a system that has similar properties and con- 
nections to the human brain without having to model its 
detailed local wiring. At the same time it is clear from 
the literature that there is no agreement on how things 
are really organized computationally at higher or modu- 
lar levels, or indeed whether it even makes sense to talk 
about modules of the brain (e.g., short term memory, 
and long term memory) as generative structures. 

Nevertheless, we expect to be guided, or one might 
say inspired, by what is known about the high level con- 
nectivity within the human brain (although admittedly 
much of our knowledge actually comes from macaques 
and other primates and is only extrapolated to be true 
of humans, a problem of concern to some brain scien- 
tists (Crick & Jones 1993)). Thus for instance we ex- 
pect to have identifiable clusters of processors which we 
will be able to point to and say they are performing a 
role similar to that of the cerebellum (e.g., refining gross 
motor commands into coordinated smooth motions), or 
the cortex (e.g., some aspects of searching generaliza- 
tion/specialization hierarchies in object recognition (Ull- 
man 1991)). 

At another level we will directly model human sys- 
tems where they are known in some detail. For instance 
there is quite a lot known about the control of eye move- 
ments in humans (again mostly extrapolated from work 
with monkeys) and we will build in a vestibulo-ocular re- 
sponse (VOR), OKR, smooth pursuit, and saccades us- 
ing the best evidence available on how this is organized 
in humans (Lisberger 1988). 



A third level of modeling or inspiration that we will 
use is at the developmental level. For instance once 
we have some sound understanding developed, we will 
use models of what happens in child language develop- 
ment to explore ways of connecting physical actions in 
the world to a ground of language and the development 
of symbols (Bates 1979), (Bates, Bretherton k Snyder 
1988), including indexical (Lempert k Kinsbourne 1985) 
and turn-taking behavior, interpretation of tone and fa- 
cial expressions and the early use of memorized phrases. 

Since we will have a number of faculty, post-doctoral 
fellows, and graduate students working on concurrent 
research projects, and since we will have a number of 
concurrently active humanoid robots, not all pieces that 
are developed will be intended to fit together exactly. 
Some will be incompatible experiments in alternate ways 
of building subsystems, or putting them together. Some 
will be pushing on particular issues in language, say, that 
may not be very related to some particular other issues, 
e.g., saccades. Also, quite clearly, at this stage we can 
not have a development plan fully worked out for five 
years, as many of the early results will change the way 
we think about the problems and what should be the 
next steps. 

In figure 1, we summarize our current plans for devel- 
oping software systems on board our series of humanoids. 
In many cases there will be earlier work off-board the 
robots, but to keep clutter down in the diagram we have 
omitted that work here. 
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System software (Oth) System software (commercial processor) 

Periperhal Motion Vergence Ullman-esque Physical schema 
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VOR 

Smooth pursuit Face pop-outs Face remembering Face recognition 

Head/ eye coord Head/ body/ eye/ coord 

Gesture recognition Facial gesture recog. Body motion recog. 

Own hand tracking Specific obj. recog. Generic object recoc 
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Bring hands Hand Grasping, Body-based metaphors 

midline linking & transfer ^ 
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